Model

Contents

Model#

The model module provides classes for different language models used in the LMCSC (Language Model-based Corrector with Semantic Constraints) system. It includes a base class LMModel and several specific model implementations.

Key Components#

  • LMModel: Base class for language models.

  • QwenModel: Class for Qwen language models.

  • LlamaModel: Class for Llama language models.

  • BaichuanModel: Class for Baichuan language models.

  • InternLM2Model: Class for InternLM2 language models.

  • UerModel: Class for UER language models.

  • AutoLMModel: Factory class for automatically selecting and instantiating the appropriate language model.

LMModel#

The LMModel class serves as the base class for all language models in the LMCSC system. It provides common functionality and interfaces for working with different types of language models.

Key Features:#

  • Initialization with pre-trained models

  • Tokenization and vocabulary management

  • Beam search preparation and output processing

  • Model parameter counting

Model-Specific Classes#

The module provides specific implementations for various language models:

  • QwenModel: Optimized for Qwen models, with support for FlashAttention2.

  • LlamaModel: Tailored for Llama models, with specific tokenization and padding strategies.

  • BaichuanModel: Designed for Baichuan models, with custom token handling.

  • InternLM2Model: Specialized for InternLM2 models.

  • UerModel: Adapted for UER models, with specific output processing.

Each of these classes inherits from LMModel and overrides certain methods to accommodate the unique characteristics of their respective model architectures.

AutoLMModel#

The AutoLMModel class provides a convenient way to instantiate the appropriate language model based on the model name or path. It automatically selects the correct model class and initializes it with the given parameters.

Example:

from lmcsc.model import AutoLMModel

# Create a Qwen model
qwen_model = AutoLMModel.from_pretrained("qwen-7b")

# Create a Llama model
llama_model = AutoLMModel.from_pretrained("llama-7b")

# Create a Baichuan model
baichuan_model = AutoLMModel.from_pretrained("Baichuan2-7B-Base")

This factory pattern allows for easy integration of new model types and simplifies the process of working with different language models within the LMCSC system.

API Documentation#

class lmcsc.model.LMModel(model: str, attn_implementation: str | None = None, *args, **kwargs)[source]#

Bases: object

A base class for language models.

Parameters:
  • model (str) – The name or path of the pre-trained model.

  • attn_implementation (str, optional) – The attention implementation to use. Defaults to None.

  • *args – Variable length argument list.

  • **kwargs – Arbitrary keyword arguments.

model_name#

The name of the model.

Type:

str

model#

The loaded language model.

Type:

AutoModelForCausalLM

tokenizer#

The tokenizer for the model.

Type:

AutoTokenizer

vocab#

The vocabulary of the model.

Type:

dict

is_byte_level_tokenize#

Whether the tokenization is byte-level.

Type:

bool

set_decoder_start_token_id()[source]#

Sets the decoder start token ID.

Raises:

NotImplementedError – This method should be implemented by subclasses.

set_vocab_size()[source]#

Sets the vocabulary size.

Raises:

NotImplementedError – This method should be implemented by subclasses.

set_convert_ids_to_tokens()[source]#

Sets the convert_ids_to_tokens function.

Raises:

NotImplementedError – This method should be implemented by subclasses.

decorate_model_instance()[source]#

Decorates the model instance with additional attributes and settings.

get_model_kwargs()[source]#

Gets the model-specific keyword arguments.

Raises:

NotImplementedError – This method should be implemented by subclasses.

prepare_beam_search_inputs(src: List[str], contexts: List[str] | None = None, prompt_split: str = '\n', n_beam: int = 8, n_beam_hyps_to_keep: int = 1)[source]#

Prepares inputs for beam search.

Parameters:
  • src (List[str]) – The source sentences.

  • contexts (List[str], optional) – The context for each source sentence. Defaults to None.

  • prompt_split (str, optional) – The prompt split token. Defaults to “n”.

  • n_beam (int, optional) – The number of beams. Defaults to 8.

  • n_beam_hyps_to_keep (int, optional) – The number of beam hypotheses to keep. Defaults to 1.

Returns:

A tuple containing model_kwargs, context_input_ids, context_attention_mask, and beam_scorer.

Return type:

tuple

prepare_prompted_inputs(src: List[str])[source]#

Prepares inputs for beam search.

Parameters:
  • src (List[str]) – The source sentences.

  • contexts (List[str], optional) – The context for each source sentence. Defaults to None.

  • prompt_split (str, optional) – The prompt split token. Defaults to “n”.

  • n_beam (int, optional) – The number of beams. Defaults to 8.

  • n_beam_hyps_to_keep (int, optional) – The number of beam hypotheses to keep. Defaults to 1.

Returns:

A tuple containing model_kwargs, context_input_ids, context_attention_mask, and beam_scorer.

Return type:

tuple

process_generated_outputs(outputs, contexts: List[str] | None = None, prompt_split: str = '\n', n_beam_hyps_to_keep: int = 1, need_decode: bool = True)[source]#

Processes the generated outputs.

Parameters:
  • outputs – The generated outputs.

  • contexts (List[str], optional) – The context for each output. Defaults to None.

  • prompt_split (str, optional) – The prompt split token. Defaults to “n”.

  • n_beam_hyps_to_keep (int, optional) – The number of beam hypotheses to keep. Defaults to 1.

  • need_decode (bool, optional) – Whether to decode the outputs. Defaults to True.

Returns:

The processed predictions.

Return type:

List[List[str]]

get_n_parameters()[source]#

Returns the number of parameters in the model in a human-readable format.

Returns:

The number of parameters in a human-readable format.

Return type:

str

class lmcsc.model.ChatLMModel(model: str, attn_implementation: str | None = None, *args, **kwargs)[source]#

Bases: LMModel

prepare_prompted_inputs(src: List[str])[source]#

Prepares inputs for beam search.

Parameters:
  • src (List[str]) – The source sentences.

  • contexts (List[str], optional) – The context for each source sentence. Defaults to None.

  • prompt_split (str, optional) – The prompt split token. Defaults to “n”.

  • n_beam (int, optional) – The number of beams. Defaults to 8.

  • n_beam_hyps_to_keep (int, optional) – The number of beam hypotheses to keep. Defaults to 1.

Returns:

A tuple containing model_kwargs, context_input_ids, context_attention_mask, and beam_scorer.

Return type:

tuple

class lmcsc.model.QwenModel(model, *args, **kwargs)[source]#

Bases: LMModel

A class for Qwen language models.

Parameters:
  • model (str) – The name or path of the pre-trained Qwen model.

  • *args – Variable length argument list.

  • **kwargs – Arbitrary keyword arguments.

set_decoder_start_token_id()[source]#

Sets the decoder start token ID for Qwen models.

set_vocab_size()[source]#

Sets the vocabulary size for Qwen models.

set_convert_ids_to_tokens()[source]#

Sets the convert_ids_to_tokens function for Qwen models.

get_model_kwargs()[source]#

Gets the model-specific keyword arguments for Qwen models. Different from other models, Qwen uses <|endoftext|> as both eos_token and pad_token. Qwen uses DynamicCache for past_key_values.

Returns:

A dictionary of keyword arguments.

Return type:

dict

class lmcsc.model.ChatQwenModel(model, *args, **kwargs)[source]#

Bases: ChatLMModel, QwenModel

class lmcsc.model.LlamaModel(model, *args, **kwargs)[source]#

Bases: LMModel

A class for Llama language models.

Parameters:
  • model (str) – The name or path of the pre-trained Llama model.

  • *args – Variable length argument list.

  • **kwargs – Arbitrary keyword arguments.

set_decoder_start_token_id()[source]#

Sets the decoder start token ID for Llama models.

set_vocab_size()[source]#

Sets the vocabulary size for Llama models.

set_convert_ids_to_tokens()[source]#

Sets the convert_ids_to_tokens function for Llama models.

prepare_beam_search_inputs(src: List[str], contexts: List[str] | None = None, prompt_split: str = '\n', n_beam: int = 8, n_beam_hyps_to_keep: int = 1)[source]#

Prepares inputs for beam search for Llama models.

Parameters:
  • src (List[str]) – The source sentences.

  • contexts (List[str], optional) – The context for each source sentence. Defaults to None.

  • prompt_split (str, optional) – The prompt split token. Defaults to “n”.

  • n_beam (int, optional) – The number of beams. Defaults to 8.

  • n_beam_hyps_to_keep (int, optional) – The number of beam hypotheses to keep. Defaults to 1.

Returns:

A tuple containing model_kwargs, context_input_ids, context_attention_mask, and beam_scorer.

Return type:

tuple

get_model_kwargs()[source]#

Gets the model-specific keyword arguments for Llama models.

Returns:

A dictionary of keyword arguments.

Return type:

dict

class lmcsc.model.ChatLlamaModel(model, *args, **kwargs)[source]#

Bases: ChatLMModel, LlamaModel

class lmcsc.model.BaichuanModel(model, *args, **kwargs)[source]#

Bases: LMModel

A class for Baichuan language models.

Parameters:
  • model (str) – The name or path of the pre-trained Baichuan model.

  • *args – Variable length argument list.

  • **kwargs – Arbitrary keyword arguments.

set_decoder_start_token_id()[source]#

Sets the decoder start token ID for Baichuan models.

set_vocab_size()[source]#

Sets the vocabulary size for Baichuan models.

set_convert_ids_to_tokens()[source]#

Sets the convert_ids_to_tokens function for Baichuan models.

get_model_kwargs()[source]#

Gets the model-specific keyword arguments for Baichuan models.

Returns:

A dictionary of keyword arguments.

Return type:

dict

class lmcsc.model.ChatBaichuanModel(model, *args, **kwargs)[source]#

Bases: ChatLMModel, BaichuanModel

class lmcsc.model.InternLM2Model(model, *args, **kwargs)[source]#

Bases: LMModel

A class for InternLM2 language models.

Parameters:
  • model (str) – The name or path of the pre-trained InternLM2 model.

  • *args – Variable length argument list.

  • **kwargs – Arbitrary keyword arguments.

set_decoder_start_token_id()[source]#

Sets the decoder start token ID for InternLM2 models.

set_vocab_size()[source]#

Sets the vocabulary size for InternLM2 models.

set_convert_ids_to_tokens()[source]#

Sets the convert_ids_to_tokens function for InternLM2 models.

get_model_kwargs()[source]#

Gets the model-specific keyword arguments for InternLM2 models.

Returns:

A dictionary of keyword arguments.

Return type:

dict

class lmcsc.model.ChatInternLM2Model(model, *args, **kwargs)[source]#

Bases: ChatLMModel, InternLM2Model

class lmcsc.model.UerModel(model, *args, **kwargs)[source]#

Bases: LMModel

A class for UER language models.

Parameters:
  • model (str) – The name or path of the pre-trained UER model.

  • *args – Variable length argument list.

  • **kwargs – Arbitrary keyword arguments.

set_decoder_start_token_id()[source]#

Sets the decoder start token ID for UER models.

set_vocab_size()[source]#

Sets the vocabulary size for UER models.

set_convert_ids_to_tokens()[source]#

Sets the convert_ids_to_tokens function for UER models.

get_model_kwargs()[source]#

Gets the model-specific keyword arguments for UER models.

Returns:

A dictionary of keyword arguments.

Return type:

dict

process_generated_outputs(outputs, contexts: List[str] | None = None, prompt_split: str = '\n', n_beam_hyps_to_keep: int = 1)[source]#

Processes the generated outputs for UER models.

Parameters:
  • outputs – The generated outputs.

  • contexts (List[str], optional) – The context for each output. Defaults to None.

  • prompt_split (str, optional) – The prompt split token. Defaults to “n”.

  • n_beam_hyps_to_keep (int, optional) – The number of beam hypotheses to keep. Defaults to 1.

Returns:

The processed predictions.

Return type:

List[List[str]]

class lmcsc.model.ChatUerModel(model, *args, **kwargs)[source]#

Bases: ChatLMModel, UerModel

class lmcsc.model.AutoLMModel[source]#

Bases: object

A factory class for automatically selecting and instantiating the appropriate language model.

This class provides a static method to create instances of specific language model classes based on the model name or path provided.

static from_pretrained(model: str, use_chat_prompted_model: bool = False, *args, **kwargs)[source]#

Creates and returns an instance of the appropriate language model class based on the model name.

Parameters:
  • model (str) – The name or path of the pre-trained model.

  • *args – Variable length argument list to be passed to the model constructor.

  • **kwargs – Arbitrary keyword arguments to be passed to the model constructor.

Returns:

An instance of the appropriate language model class.

Return type:

LMModel

Raises:

ValueError – If an unsupported model type is specified.